InterPARES Trust AI - Artificial Intelligence

big data [English]

Other Languages

big data (Spanish)
big data (Portuguese)

InterPARES Definition

n. ~ A dataset – often an aggregation of datasets from different sources, made for different purposes, and with different structures – that is so large that performance requirements become a significant factor when designing and implementing a data management and analysis system.

General Notes

Usage is often ambiguous as 'big data' often is used more for marketing than as a defining concept. The volume of 'big data' varies with context, and is not determined by a specific, quantitative measure. "The key feature of the paradigmatic change is that analytic treatment of data is systematically placed at the forefront of intelligent decision-making. The process can be seen as the natural next step in the evolution from the 'Information Age' and 'Information Societies' (Hilbert, 2013, 4).

Other Definitions

OED Web 2018 (†401 s.v. "big data"): big data (also with capital initials) ~ n., Computing data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges; (also) the branch of computing involving such data.

Citations

Arbesman 2013 (†258 ): The term "big data" has been in circulation since at least the 1990s, when it is believed to have originated in Silicon Valley. (†220)
Ariely 2013 (†348 ): Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it... (†329)
Arthur 2013 (†454 ): Some people like to constrain big data to digital inputs like web behavior and social network interactions; however the CMOs and CIOs I talk with agree that we can’t exclude traditional data derived from product transaction information, financial records and interaction channels, such as the call center and point-of-sale. All of that is big data, too, even though it may be dwarfed by the volume of digital data that’s now growing at an exponential rate. ¶In defining big data, it’s also important to understand the mix of unstructured and multi-structured data that comprises the volume of information. (†624)
Arthur 2013 (†454 ): Industry leaders like the global analyst firm Gartner use phrases like “volume” (the amount of data), “velocity” (the speed of information generated and flowing into the enterprise) and “variety” (the kind of data available) to begin to frame the big data discussion. Others have focused on additional V’s, such as big data’s “veracity” and “value.” (†627)
Ballard 2014 (†528 p. 2): Big data is defined as “Extracting insight from an immense volume, variety, and velocity of data, in context, beyond what was previously possible.” (This is the IBM definition). Because data has become so voluminous, complex, and accelerated in nature, traditional computing methods no longer suffice. . . . Big data represents the confluence of having new types of available information, new technical capability, and processing capacity, and the desire and belief that it can be used to solve problems that were previously impossible. At the same time, many of the concepts of big data are not new. (†837)
Ballard 2014 (†528 p. 3): The origins of the three key dimensions of “big data” (volume, variety, and velocity) were first described by Gartner Group’s Doug Laney in 2001 in a research paper, where he conveyed the impact of e-commerce on data volumes, increased collaboration, and the desire to more effectively use information as a resource. [http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf] (†839)
Ballard 2014 (†528 p. 4): big data is most frequently described according to the three dimensions or characteristics that Doug Laney first described: volume, variety, and velocity. More recently, experienced big data practitioners have come to appreciate that veracity is also a key element, that is, the “trust factor” behind the data. Although technologists do not always acknowledge the final “V” (value), gaining insight needs to translate into business value in order for any big data effort to be truly successful. (†840)
Big Data 2011 (†265 p. iv): Big data – large pools of data that can be captured, communicated, aggregated, stored, and analyzed – is now part of every sector and function of the global economy. ¶ Can big data play a useful economic role? While most research into big data thus far has focused on the question of its volume, our study makes the case that the business and economic possibilities of big data and its wider implications are important issues that business leaders and policy makers must tackle. To inform the debate, this study examines the potential value that big data can create for organizations and sectors of the economy and seeks to illustrate and quantify that value. We also explore what leaders of organizations and policy makers need to do to capture it. (†224)
Bohle 2013 (†301 ): The main thing is that in response to the Executive Order there will be three categories of data inventories. In the first are all the records, all the data sets we collect, that can be found in a database now that is publicly available, and we will explain what that data is and where you can find it. These will be a mix of data sets on Data.gov and data sets found on other related web sites, such as those of an agency, a national lab, a contractor, a university, or a researcher’s home page. Secondly, there are data sets that could be made available to the public but are not yet available, so we need to disclose when and where they will be made available. Finally, there will be data sets that are restricted for national security and privacy issue reasons. (†278)
Cohen 2014A (†458 ): The definition of "big data" seems to self-refute its utility as a term. In essence, it asserts that it is an advertising term used for ... nothing in particular. The notion of "big" is a relative one in any case, and data being "big" is really asserted to imply that the volume is large relative to something else that is not defined. For a perspective on this, I had a recent forensics case involving something on the order of 10 trillion records. Is that a "big data" case? If so, is a mere 1 trillion records also "big"? How about 10B? 100M, 1M, 100,000? 1,000? 2? In my view, the key to a definition is that it allows differentiation between at least two things - things that are included in the definition and things that are not. There may be a fuzzy boundary of course, but in this case, we cannot clearly identify anything that is in or out of the realm. So I think it should be removed as a definition, or at a minimum, be identified as a term that is not defined formally as of this time. ¶I should also note: " ~ An approach for the analysis of large volumes of information with the intent of discovering and identifying new relationships in different datasets, " Since when is intent relevant to differentiating technical things? And how can intent be determined? If I have the identical volumes using the identical techniques, but a different intent, is it then not big data? ¶"especially datasets that are too large and complex to manipulate or interrogate with standard methods or tools." - but all current "big data" is interrogated and manipulated with "standard methods and tools". I am pretty up to date on this technology, and it is more or less the same set of operations done at smaller scale. ¶If you are going to associate something useful with "big data", I think it is the ability to fuse together diverse datasets and do analysis across those datasets. For example, mapping has become different in kind as it scaled up. Today, we can map, as an example, diseases in marine mammals against temperature, rainfall, animal size, weight, and presence of various other conditions, and map that against pollution levels, fishing activities, vacation periods for people, and many other similar things. As a result, correlations can be asserted and tested far more quickly than they otherwise could have been. But this is not because of a lack of standard methods, but rather because of the translation of many large databases into a common form so that they can be compared and measured against each other. Similarly, 20 years ago, I could simulate a few million sequences in a day, but since I was working against a space that was far larger, that represented only a very small portion of the total set of things that could occur. With the increased scales of computational capacity now available and the ability to rent large volumes of systems for short times relatively inexpensively, I can now do enough simulations to test designs against each other, whereas before, I couldn't even get a good test of a single design. That's what "big data" seems to really be about. (†649)
Cukier, et al. 2013 (†473 ): Big data starts with the fact that there is a lot more information floating around these days than ever before, and it is being put to extraordinary new uses. . . . Big data is about more than just communication: the idea is that we can learn from a large body of information things that we could not comprehend when we used only smaller amounts. ¶ Given this massive scale, it is tempting to understand big data solely in terms of size. But that would be misleading. Big data is also characterized by the ability to render into data many aspects of the world that have never been quantified before; call it “datafication.” For example, location has been datafied, first with the invention of longitude and latitude, and more recently with GPS satellite systems. Words are treated as data when computers mine centuries’ worth of books. ¶ Using great volumes of information in this way requires three profound changes in how we approach data. The first is to collect and use a lot of data rather than settle for small amounts or samples, as statisticians have done for well over a century. The second is to shed our preference for highly curated and pristine data and instead accept messiness: in an increasing number of situations, a bit of inaccuracy can be tolerated, because the benefits of using vastly more data of variable quality outweigh the costs of using smaller amounts of very exact data. Third, in many instances, we will need to give up our quest to discover the cause of things, in return for accepting correlations. With big data, instead of trying to understand precisely why an engine breaks down or why a drug’s side effect disappears, researchers can instead collect and analyze massive quantities of information about such events and everything that is associated with them, looking for patterns that might help predict future occurrences. Big data helps answer what, not why, and often that’s good enough. (†672)
Hesseldahl 2013 (†267 ): As a useful technology, Big Data is, according to Gartner’s reckoning, at or near the peak in the hype cycle, when expectations are inflated and out of sync with what is deliverable in reality. ¶ Big Data is a catch-all phrase for the notion that, embedded within the vast troves of data that businesses and governments gather, is useful, actionable intelligence that could lead to new efficiency, improved business processes, lower costs, higher profits and so on. All that’s needed, the argument goes, is the will and expertise to perform the relevant analysis on it. (†231)
Hesseldahl 2013 (†267 ): When a phrase enters the tech industry lexicon in a big way, and has a good way of crystallizing a seemingly overarching trend, it’s usually not long before the very same phrase becomes a victim of its own overuse and is dismissed as “all hype.” . . . As a useful technology, Big Data is, according to Gartner’s reckoning, at or near the peak in the hype cycle, when expectations are inflated and out of sync with what is deliverable in reality. (†314)
Hesseldahl 2013 (†267 ): Big Data is a catch-all phrase for the notion that, embedded within the vast troves of data that businesses and governments gather, is useful, actionable intelligence that could lead to new efficiency, improved business processes, lower costs, higher profits and so on. (†315)
Hilbert 2013 (†320 pp. 31-32): As with all previous examples of technology-based innovation for development, also the Big Data paradigm runs through a slow and unequal diffusion process that is compromised by the lacks of infrastructure, human capital, economic resource availability, and institutional frameworks in developing countries. This inevitably creates a new dimension of the digital divide: a divide in the capacity to place the analytic treatment of data at the forefront of informed decision-making. This divide does not only refer to the availability of information, but to intelligent decision-making and therefore to a divide in (data-based) knowledge. (†293)
Hilbert 2013 (†320 pp. 4-5): The crux of the “Big Data” paradigm is actually not the increasingly large amount of data itself, but its analysis for intelligent decision-making (in this sense, the term “Big Data Analysis” would actually be more fitting than the term “Big Data” by itself). Independent from the specific peta-, exa-, or zettabytes scale, the key feature of the paradigmatic change is that analytic treatment of data is systematically placed at the forefront of intelligent decision-making. The process can be seen as the natural next step in the evolution from the “Information Age” and “Information Societies” ... to “Knowledge Societies”: building on the digital infrastructure that led to vast increases in information, the current challenge consists in converting this digital information into knowledge that informs intelligent decisions. (†292)
Hutchinson 2015 (†632 ): [A four-legged stool: the actual collection of data, data storage, computing power, and software.] . . . It’s all about sorting variables and tracking them, piecing together things that humans can’t. Computers are very good at sifting tremendous quantities of information (with the right software, of course), and that’s the core of big data. (†1430)
IBM Four Vs 2103 (†522 ): Infographic illustrating volume (scale of data), variety (different forms of data), velocity (analysis of streaming data), and veracity (uncertainty of data). (†827)
ITrust Research Project 9 Proposal, 2013 (†389 1): Examples include the call records of telecommunications companies, the customer purchase orders of large companies such as Amazon and Walmart, and the large volumes of repetitive case files generated by government departments such as Human Resources Development and Immigration. Data are extracted and reassembled or aggregated through various stages of computer-based analysis to produce statistics that can be used by organizations to achieve competitive advantage, enhance services, support planning, etc. (†423)
Khramtsovsky 2014 (†459 ): I am unhappy with the proposed definition of the “big data”. According to it, big data is just a lot of bytes. From my point of view, big data are characterized by a need to use non-traditional advanced analytic tools to mine for hidden dependencies and facts rather than by their sheer volume. ¶In my opinion, a heap of standard case files that are processed only individually is just a lot of data, but not a “big data” :) (†652)
Kusnetzky 2010 (†337 ): "Big Data" is a catch phrase that has been bubbling up from the high performance computing niche of the IT market. Increasingly suppliers of processing virtualization and storage virtualization software have begun to flog "Big Data" in their presentations. . . . If one sits through the presentations from ten suppliers of technology, fifteen or so different definitions are likely to come forward. . . . The phrase refers to the tools, processes and procedures allowing an organization to create, manipulate, and manage very large data sets and storage facilities. (†318)
Laney 2013 (†481 ): Many vendors and pundits have attempted to augment Gartner’s original “3Vs” from the late 1990s with clever(?) “V”s of their own. However, the 3Vs were intended to define the proportional dimensions and challenges specific to big data. Other “V”s like veracity, validity, value, viability, etc. are aspirational qualities of all data, not definitional qualities of big data. (†718)
Magoulas and Lorica 2009 (†338 p. 2): Big Data: when the size and performance requirements for data management become significant design and decision factors for implementing a data management and analysis system. For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration. (†319)
Manyika, et al. 2011 (†306 1): “Big data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. This definition is intentionally subjective and incorporates a moving definition of how big a dataset needs to be in order to be considered big data — i.e., we don’t define big data in terms of being larger than a certain number of terabytes (thousands of gigabytes). We assume that, as technology advances over time, the size of datasets that qualify as big data will also increase. Also note that the definition can vary by sector, depending on what kinds of software tools are commonly available and what sizes of datasets are common in a particular industry. With those caveats, big data in many sectors today will range from a few dozen terabytes to multiple petabytes (thousands of terabytes). (†285)
Mazmanian 2014 (†682 ): The proliferation of government data sets is providing developers with ample fodder for writing useful and potentially profitable applications around census, weather, health, energy, business, agricultural and other information. But as the government makes more and more data discoverable and machine readable, there is the threat that disparate threads can be pieced together in a way that yields information that is supposed to be private. This kind of analysis through the combination of big data sets is called the mosaic effect. (†1560)
McDonald and Lévillé 2014 (†519 ): The concept of big data goes beyond volume. Big data can be defined as large quantities and varieties of data that, due to their fast and sometimes ‘real-time’ availability, require extensive manipulation and mining through the intervention of various non-traditional technologies and tools. When properly mined, combined, manipulated, and analyzed, this information can lead to new and better analytics, easier and more opportunities for validation and faster insights (Vriens, 2013). The concept of big data can be summarized by "three Vs": volume, variety and velocity. (†816)
McDonald and Lévillé 2014 (†519 ): Unlike open data initiatives where the objective is to make data produced using public funds available to a wide variety of audiences, the objective of big data initiatives is to help organizations exploit the value of the often voluminous and repetitive data that may reside in their databases to pursue their strategic and operational priorities and goals. These could include achieving a competitive advantage, enhancing services, improving business productivity, generating new economic value, maximizing marketing opportunities, improving outreach, supporting development and planning, or holding the organization to account. (†817)
McDonald and Lévillé 2014 (†519 ): Big data initiatives differ from open data initiatives in a number of important ways, beginning with their objectives. The objectives of big data initiatives generally focus on the value of data to the interests of an individual organization or a collection of organizations (e.g. a partnership) in terms of its operational and/or strategic priorities. Open data initiatives on the other hand focus on the value of data to external interests in response to public policies on openness and transparency, coupled with the interests of various industry sectors in using and reusing publically funded data for economic and social development. Big data initiatives often involve the development of new processes and systems designed to extract, combine, manipulate and otherwise exploit data from existing systems, while open data initiatives tend to be based on existing datasets, small databases, and statistics that are packaged for dissemination or access through a portal. Furthermore, big data initiatives are seen in both private and public sector organizations, while open data initiatives tend to be supported by the public sector. (†818)
McDonald and Lévillé 2014 (†519 ): The extent to which the data used to support big data initiatives can be trusted is dependent upon the quality, integrity and completeness of the controls that are in place to manage the data that is contributing to these initiatives. These controls are normally established as a result of the steps involved in planning, designing, testing, implementing, maintaining and reviewing the systems responsible for generating and managing the data. Although data security, protection of privacy, and the availability of storage space are significant challenges for big data initiatives, the biggest and perhaps the root challenge lies in building standards and processes that enable the large volumes of data to be manipulated in a manner that will ensure that the results, the final output will have integrity. (†821)
Neff 2013 (†318 p. 118): Such data [big data] are neither new nor novel, but more are being generated. This is not so much a new kind of data but huge amounts of it. Assembled and analyzed, it could all be a potential valuable resource. Or it could be what privacy guru Bruce Schneier calls the ‘‘pollution’’ of the information age, a byproduct produced by virtually every technological process, something that is more costly to manage than its value. (†287)
Neff 2013 (†318 p. 119): Big data has a rhetoric problem. When people talk about data-driven health innovation they often neglect the power of framing information as ‘‘data.’’ They also assume that everyone thinks about health data the same way they do. Regardless of how it is generated, digital information only becomes data when it is created as such... Data are meaningful because of how someone collects, interprets, and forms arguments with it. Data are not neutral. This is why Lisa Gitelman calls raw data an ‘‘oxymoron,’’ a contradiction in terms that hides the reality of the work involved in creating data. Data, I argue in an article with Brittany Fiore-Silfvast, are so important precisely because people make (or imagine) data function across multiple social worlds. Data are not inherently important or interesting, rather, by definition, data are used to make arguments relative. Put simply, data is only data in the eye of the stakeholder” (†288)
Noyes 2016 (†780 ): The term "big data" is itself a relative term, boiling down essentially to "anything that's so large that you couldn't manually inspect or work with it record by record." (†1995)
Ohlhorst 2013 (†305 1): Big Data is often described as extremely large data sets that have grown beyond the ability to manage and analyze them with traditional data processing tools. . . . Big Data defines a situation in which data sets have grown to such enormous sizes that conventional information technologies can no longer effectively handle either the size of the data set or the scale and growth of the data set. . . . The primary difficulties are the acquisition, storage, searching, sharing, analytics, and visualization of data. (†284)
Parry 2013 (†336 ): Mr. Reed [who served as chief technology officer in President Obama’s 2012 campaign] is generally bullish on the power of data. . . . But, with apologies to the technology companies sponsoring the SUNY event, Mr. Reed skewered their industry’s promotion of the buzzwords “Big Data.” “The ‘big’ there is purely marketing,” Mr. Reed said. “This is all fear … This is about you buying big expensive servers and whatnot.” (†317)
Provost and Fawcett 2013 (†302 p. 55): One way to think about the state of big data technologies is to draw an analogy with the business adoption of internet technologies. In Web 1.0, businesses busied themselves with getting the basic internet technologies in place so that they could establish a web presence, build electronic commerce capability, and improve operating efficiency. We can think of ourselves as being in the era of Big Data 1.0, with firms engaged in building capabilities to process large data. These primarily support their current operations–for example, to make themselves more efficient... ¶ Similarly, we should expect a Big Data 2.0 phase to follow Big Data 1.0. Once firms have become capable of processing massive data in a flexible fashion, they should begin asking: What can I now do that I couldn’t do before, or do better than I could do before? This is likely to usher in the golden era of datascience. The principles and techniques of data science will be applied far more broadly and far more deeply than they are today. (†283)
Turk 2014 (†373 ): “Big data” might be a modern buzzword, but the concept goes back as long as people have been collecting records (albeit with our perception of what counts as “big” growing over the years). And while interactive data visualisations and infographics might be new to the world of digital media, the art of illustrating data has a long history. ¶ In its first science exhibition, which opened yesterday, the British Library is showing off some iconic scientific diagrams, new and old, from the country’s science collection. It’s a consideration of how scientific and technological advances have shaped the way we visualise information. Johanna Kieniewicz, lead curator of the exhibition, said they particularly wanted to draw links between data past and present: “Data that is centuries old from collections like ours is now being used to inform cutting edge science.” (†377)
UN Global Pulse Big Data 2013 (†319 p. 2): Big Data is characterized by the “3 Vs:” greater volume, more variety, and a higher rate of velocity. A fourth V, for value, can account for the potential of Big Data to be utilized for development. (†289)
UN Global Pulse Big Data 2013 (†319 p. 4): If properly mined and analyzed, Big Data can improve the understanding of human behavior and offer policymaking support for global development in three main ways: · 1. Early warning [:] Early detection of anomalies can enable faster responses to populations in times of crisis · 2.Real-time awareness [:] Fine-grained representation of reality through Big Data can inform the design and targeting of programs and policies. · 3. Real-time feedback [:] Adjustments can be made possible by real-time monitoring the impact of policies and programs. (†290)
WEF 2013 (†322 p. 2): As the box on the previous page demonstrates ‘big data’ is generally about relationships. More specifically, it is about how individuals relate to other individuals and institutions having access to information about them. (†295)
WEF 2013 (†322 p. 1): Yet the discussion of its [Big Data] promise needs to be accompanied by the recognition that there is a personal side to this flood of data: it is created by unique individuals, and as such there are issues related to privacy and data ownership that must be addressed if the potential for good is to be realized. (†296)